Text Representation for Efficient Document Annotation

نویسندگان

  • Christin Seifert
  • Eva Ulbrich
  • Roman Kern
  • Michael Granitzer
چکیده

In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labellers a tedious and time-consuming work. To reduce the labelling time for single documents we propose to use condensed representations of text documents instead of the full-text document. These condensed representations are key sentences and key phrases and can be generated in a fully unsupervised way. We extended and evaluated the TextRank algorithm to automatically extract key sentences and key phrases. For representing key phrases we propose a layout similar to a tag cloud. In a user study with 37 participants we evaluated whether document labelling with these condensed representations can be done faster and equally accurate by the human labellers. Our evaluation shows that the users labelled tag clouds twice as fast and as accurately as full-text documents. While further investigations for different classification tasks are necessary, this insight could potentially reduce costs for the labelling process of text documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Document Representation Methods for Clustering Bilingual Documents

Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management....

متن کامل

A Flexible Representation of Heterogeneous Annotation Data

This paper describes a new flexible representation for the annotation of complex structures of metadata over heterogeneous data collections containing text and other types of media such as images or audio files. We argue that existing frameworks are not suitable for this purpose, most importantly because they do not easily generalize to multi-document and multimodal corpora, and because they of...

متن کامل

Exploiting Coreference Annotations for Text-to-Hypertext Conversion

The paper describes an annotation scheme for coreference developed within the application context of text-to-hypertext conversion. In this context coference is used (1) for generating document-internal and cross-document hyperlinks, and (2) for resolving anaphoric expressions in order to achieve cohesive closedness in hypertext nodes. We will argue that for the purpose of cross-document linking...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. UCS

دوره 19  شماره 

صفحات  -

تاریخ انتشار 2013